exp: general-agent by mikasenghaas · Pull Request #2525 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-05-17T20:02:22Z

Summary

Add general-agent to pyproject.toml (envs list, workspace members, uv sources) and pull pytest-asyncio into the dev group so the env's tests are runnable.
Add public configs/general_agent/ with three RLM configs using general-agent-solver-rlm, all logging to the general-agent-debug wandb project:
- rl_qwen3_0p6b.toml — single-GPU smoke test
- rl_qwen3_4b.toml — 4 train + 4 infer GPUs, max_steps=200
- rl_qwen3_30b_a3b.toml — multi-node (1 train + 1 infer, dp=2 / tp=4), max_steps=400
Bump deps/research-environments to c752781 (was origin/main at time of write; main has since advanced — re-bump before merge if a refresh is wanted). Env version bumps:
- ddbc 0.1.1 → 0.1.2
- ddbc_rlm 0.1.5 → 0.1.6
- deepdive 0.2.7 → 0.2.9
- deepdive_rlm 0.2.11 → 0.2.13
- general_agent 0.1.0 → 0.1.4
- opencode_deepdive 0.1.15 → 0.1.16
- rlm_deepdive 0.2.3 → 0.2.4
- rlm_swe 0.3.4 → 0.4.2
Bump configs/private submodule pointer (now a merge commit 70c3503 that joins the PR's behavior-learning RESULTS writeups with main's rlm5 X-Session-ID header cleanup).
Skip vf-eval-style TOMLs (detected via top-level eval list) in tests/unit/test_configs.py::test_load_configs so non-entrypoint configs don't fail validation.
Document exp/ as the branch prefix for experiment branches in AGENTS.md.

Verification

uv sync --all-extras rebuilds general-agent==0.1.4; entry point general-agent-solver-rlm resolves and vf.load_environment("general-agent-solver-rlm") returns a ComposableEnv.
uv run pytest tests/unit/test_configs.py — 106 passed (covers all three new configs/general_agent/*.toml).

Note

Low Risk
Mostly new TOMLs, dependency wiring, and a targeted config-test skip; no changes to core training or auth paths in this diff.

Overview
Wires the general-agent research environment into the repo and adds RL experiment configs for general-agent-solver-rlm on Qwen3 at 0.6B (smoke), 4B, and 30B-A3B scales, all targeting the general-agent-debug W&B project.

Packaging: general-agent is added to the envs extra, uv workspace members/sources, and uv.lock (new editable general-agent==0.1.4; lock also bumps deepdive / opencode-deepdive versions shown in the diff). Dev deps gain pytest-asyncio for async env tests.

Configs: New configs/general_agent/rl_qwen3_{0p6b,4b,30b_a3b}.toml tune steps, seq length, GPU/deployment layout, orchestrator batch/rollouts, and inference parallelism for each model size.

Tests / docs: tests/unit/test_configs.py skips TOMLs with a top-level eval list (vf-eval, not prime-rl entrypoints). AGENTS.md documents exp/ branch prefix for experiment work.

^{Reviewed by Cursor Bugbot for commit 278ed64. Bugbot is set up for automated code reviews on this repo. Configure here.}

This reverts commit b8c33de.

…ation configs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l in pre-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Source ~/.env directly in the shell before uv run rl instead; env vars propagate to sbatch via --export=ALL. Reverts 3703dc0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up #1395 which fixes the ParsedToolCall subscription bug in renderer_client.from_native_response — previously raised 'ParsedToolCall' object is not subscriptable on every rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

4B-Instruct hallucinated tool names and gave up after a few errors (reward ~0.4% at step 2). Try the thinking variant which is better at structured tool-use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cting reward Set behavior_judge_model + behavior_reward_alpha=0.0 so the judge runs and behavior_<key> metrics get logged, but final_reward stays equal to task_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fecting reward Same change as baseline: enable judge for metrics but alpha=0.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ics in prompt run - prompt run: enable judge with alpha=0.0 so behavior_<key> metrics get logged but final_reward stays equal to task_reward (same setup as baseline) - all four configs: max_steps 1000 → 200 to keep ablations bounded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…m-id fix Picks up PrimeIntellect-ai/research-environments@769298b1 which forwards PRIME_TEAM_ID as X-Prime-Team-ID on behavior judge requests, so the judge bills the team balance instead of the user's personal balance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up eaaabf3c which makes final_reward use state.get() so judge failures don't zero out task_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous runs (cp=1, 32K) saw 6-9% truncation rate and output_tokens hitting the 32K cap on long trajectories. Double the seq_len and max_model_len; cp=2 keeps per-rank activation memory flat under the 2x context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up fdca6d76 which logs behavior_reward as the raw judge mean (independent of task_reward) and moves the solution gate into final_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All 4 phase-2 runs completed step 200 with promising trajectories. Extend max_steps to 400 to continue training from the step_200 checkpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bmodule The general-agent + behavior-learning configs are not meant for the public prime-rl repo. Move them into the research-configs submodule mounted at configs/private/ so they share access controls with the rest of our internal experiment configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # deps/verifiers

…g RESULTS.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e eval-config helper - Bump deps/research-environments to origin/main HEAD (general-agent 0.1.4) - Add configs/general_agent/{rl_qwen3_0p6b_debug,rl_qwen3_4b,rl_qwen3_30b_a3b}.toml using the general-agent-solver-rlm env - Rename is_vf_eval_config -> is_eval_config in tests/unit/test_configs.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…agent-debug wandb project Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ULTS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t step counts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # configs/private

mikasenghaas added 11 commits May 17, 2026 20:01

exp: general-agent

c7fd27c

docs: update PR description guidance

b8c33de

Revert "docs: update PR description guidance"

001711d

This reverts commit b8c33de.

feat(general-agent): add RLM behavior learning configs

1017b19

feat(general-agent): tune behavior reward shaping

2153495

feat(general-agent): log behavior judge state for audits

0409c35

feat(general-agent): document split behavior rewards

818957d

feat(general-agent): document pruned behavior metrics

b814b5c

docs(general-agent): record behavior judge calibration runs

fabcacc

feat(general-agent): define behavior learning ablations

a560650

chore(general-agent): move behavior learning configs

c1d7e07

mikasenghaas changed the title ~~exp: general-agent~~ exp: general-agent + behavior-learning May 18, 2026

mikasenghaas and others added 17 commits May 18, 2026 07:27

exp(behavior-learning): add labels, prime monitor, remove cp from abl…

fd80d03

…ation configs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

exp(behavior-learning): source ~/.env before run to fix auth

abbe7ec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(slurm): source ~/.env before uv run in single-node RL template

3703dc0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

exp(behavior-learning): rename prefix, pin rlm_ref, prime sandbox kil…

995fd00

…l in pre-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

revert(slurm): drop source ~/.env from single-node RL template

94f1d05

Source ~/.env directly in the shell before uv run rl instead; env vars propagate to sbatch via --export=ALL. Reverts 3703dc0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

exp(behavior-learning): log behavior metrics in prompt run without af…

13d721c

…fecting reward Same change as baseline: enable judge for metrics but alpha=0.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(deps): bump research-environments to include final_reward fix

58dc920

Picks up eaaabf3c which makes final_reward use state.get() so judge failures don't zero out task_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(deps): bump research-environments to log un-gated behavior_reward

ba08c3f

Picks up fdca6d76 which logs behavior_reward as the raw judge mean (independent of task_reward) and moves the solution gate into final_reward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

exp(behavior-learning): extend max_steps to 400 for phase-3 continuation

26347cb

All 4 phase-2 runs completed step 200 with promising trajectories. Extend max_steps to 400 to continue training from the step_200 checkpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into exp/general-agent

1c2f61d

# Conflicts: # deps/verifiers

mikasenghaas changed the title ~~exp: general-agent + behavior-learning~~ exp: general-agent May 22, 2026

mikasenghaas and others added 6 commits May 22, 2026 10:08

chore(deps): bump configs/private to include phase-3 behavior-learnin…

fa63b17

…g RESULTS.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

exp(general-agent): rename 0p6b config, point all configs at general-…

89156ce

…agent-debug wandb project Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(deps): bump configs/private to expand behavior-learning RESULTS

8a6c42c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(deps): bump configs/private for tightened behavior-learning RES…

503861e

…ULTS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

exp(general-agent): drop num_workers/max_retries/tool_call_parser, se…

54aee40

…t step counts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikasenghaas requested a review from samsja May 22, 2026 04:50

mikasenghaas marked this pull request as ready for review May 22, 2026 04:51

Merge branch 'main' into exp/general-agent

278ed64

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # configs/private

samsja approved these changes May 25, 2026

View reviewed changes

mikasenghaas merged commit 0057f3b into main May 25, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: general-agent#2525

exp: general-agent#2525
mikasenghaas merged 35 commits into
mainfrom
exp/general-agent

mikasenghaas commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented May 17, 2026 •

edited

Loading